Using GEOmetadb to query datasets of interest and selecting relavent ones (dont evaluate on knit)
if(!file.exists('GEOmetadb.sqlite'))
GEOmetadb::getSQLiteFile()
con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite',
synchronous = NULL)
Geo_tables <- DBI::dbListTables(con)
Geo_tables
Query datasets with high throughput seqeuncing and RNA-Seq in title. I also browsed through the GEO interactive search to find the dataset.
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.title LIKE '%RNA-Seq%' AND",
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",sep=" ")
result <- DBI::dbGetQuery(con,sql)
Now that the dataset of interest is chosen, retrieve the metadata and supplementary files of interest.
chosen_gse <- invisible(GEOquery::getGEO("GSE148372", GSEMatrix = FALSE))
chosen_gse_info <- data.frame(head(GEOquery::Meta(chosen_gse)))
chosen_geo <- names(GEOquery::GPLList(chosen_gse))[1]
chosen_geo <- GEOquery::Meta(GEOquery::getGEO(chosen_geo))
chosen_geo$title
## [1] "Illumina NovaSeq 6000 (Homo sapiens)"
chosen_geo$submission_date
## [1] "Mar 02 2018"
chosen_geo$last_update_date
## [1] "Nov 05 2018"
chosen_geo$organism
## [1] "Homo sapiens"
length(chosen_geo$series_id)
## [1] 7818
length(chosen_geo$sample_id)
## [1] 169121
Title : Illumina NovaSeq 6000 (Homo sapiens)
Submission data : Mar 02 2018
Last update data : Nov 05 2018
Organism : Homo sapiens
Number of GEO datasets that use this technology :
7818
Number of GEO samples that use this technology : 169121
Contact Location : USA
The dataset only has one supplementary file with counts, this code simply looks into some initial info and stores the csv dataframe in memory.
if (!exists('sfiles'))
sfiles <- GEOquery::getGEOSuppFiles('GSE148372', filter_regex = "csv")
fnames <- rownames(sfiles)
fnames
## [1] "/home/rstudio/GSE148372/GSE148372_ko.count.csv.gz"
RNAfile <- read.csv(fnames[1], check.names = FALSE)
colnames(RNAfile)[1] <- "gene"
RNAfile
colnames(RNAfile)
## [1] "gene" "HCC1806.wildtype.g2" "HCC1806.wildtype.g1"
## [4] "HCC1806.PMVKko.g1" "HCC1806.PMVKko.g2" "HCC1806.SCAPko.g1"
## [7] "HCC1806.SCAPko.g2" "HCC1806.SREBF1ko.g1" "HCC1806.SREBF1ko.g2"
## [10] "HCC1954.wildtype.g2" "HCC1954.wildtype.g1" "HCC1954.SREBF1ko.g1"
## [13] "HCC1954.SREBF1ko.g2" "JIMT1.wildtype.g2" "JIMT1.wildtype.g1"
## [16] "JIMT1.PMVKko.g1" "JIMT1.PMVKko.g2" "JIMT1.SCAPko.g1"
## [19] "JIMT1.SCAPko.g2" "JIMT1.SREBF1ko.g1" "JIMT1.SREBF1ko.g2"
## [22] "MDAMB231.wildtype.g2" "MDAMB231.wildtype.g1" "MDAMB231.SREBF1ko.g2"
dim(RNAfile)
## [1] 23686 24
This section will look into generating the HGNC symbol mapping for
each gene in the data.
Since the dataset does not have ensemble ids attached to it, the gene
identifier is used to try and match the symbols.
if(!exists('ensembl')){
ensembl <- biomaRt::useMart(biomart = "ensembl", dataset="hsapiens_gene_ensembl")
}
geneID_map <- biomaRt::getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'),
filters = 'ensembl_gene_id',
values = RNAfile$gene,
mart = ensembl)
dim(geneID_map)
## [1] 0 2
As seen above, the geneId map does not result in any symbols to match
with. Since there is not ensemble id
attached to the dataset, the gene identifiers will be used in this
report instead of HGNC symbols.
grouping <-lapply(colnames(RNAfile)[2:length(colnames(RNAfile))], function (x) {unlist(strsplit(x, split='\\.'))[1]})
RNA_matrix <- data.frame(RNAfile, check.names = F)
colnames(RNA_matrix) <- c('gene', grouping)
df <- RNA_matrix[, -1]
colnames(df) <- grouping
df <- t(rowsum(t(df), group = colnames(df), na.rm = T ))
colnames(df)
## [1] "HCC1806" "HCC1954" "JIMT1" "MDAMB231"
grouping <- data.frame(unique(grouping))
colnames(df)
## [1] "HCC1806" "HCC1954" "JIMT1" "MDAMB231"
# create DEGList and calculate normalization factors:
dge = edgeR::DGEList(counts=data.matrix(df), group=grouping)
dge = edgeR::calcNormFactors(dge, method = "TMM")
normalized_counts <- edgeR::cpm(dge)
normalized_counts <- data.frame(cbind(data.frame(normalized_counts), gene=RNAfile$gene))
normalized_counts
## Loading required package: limma
It is evident that the HCC1954 and MDAMB231 cell lines are more represented in this dataset while JIMT1 and HCC1806 are more closely represented.
par(mfrow=c(2,2))
barplot(normalized_counts$HCC1806, names.arg = normalized_counts$gene, main="HCC1806")
barplot(normalized_counts$HCC1954, names.arg = normalized_counts$gene, main="HCC1954")
barplot(normalized_counts$JIMT1, names.arg = normalized_counts$gene, main="JIMT1")
barplot(normalized_counts$MDAMB231, names.arg = normalized_counts$gene, main="MDAMB231")
MDAMB231, HCC1806, HCC1954, and JIMT1 cell lines are controls in the dataset. They are compared with infected CRISPR/Cas9-expressing versions of themselves in the dataset.
The study mentions that a large subset of deaths from cancers are attributed to metastasis. Since large scale research in this field is impractivcal the use of human cells in mouse xenografts can allow to scale this.
No, the counts dataset had already one row per gene for every cell line in the experiment.
Yes, the dataset did not come with ensemble ids and therefore had to use the gene symbols already present in the dataset for the analysis.
The dataset is not too large and did not have much variation for obvious outliers and therefore none have been removed.
Replicate columns were merged in by summing up count values.
The final dataset includes 23,686 rows of normalized counts for each of the cell lines.